TODO: Refine title
Initial Questions
TODO: Must have at least two questions. It is best to have different types of problems, ie one regression, and one classification
Objective
TODO: Analysis: Identify the questions, what is the objective/goal of processing this dataset? What answers are you interested to find through this dataset.
TODO: Determine the details about the dataset (eg. title, year, the purpose of dataset, dimension content, structure, summary) by exploring the raw data.
TODO: Short introduction with objective of the project.
Data Cleaning and Preprocessing
TODO: Which section of the data do you need to tidy?
TODO: Prepare data for analysis by correcting the variables and contents of the data.
TODO: Putting it all together as a new cleaned/processed dataset: For this task, you are also encouraged to explore any cleaning packages in R other than those learned in the course (diplyr, tidyr, lubridate, etc).
- Step 1 : Import libraries
- Step 2 : Import dataset
- Step 3 : Handle missing/duplicate
- Step 4 : Preprocessing
Import libraries
# if (!require('dplyr'))
# install.packages('dplyr', repos='https://cran.asia/');
if (!require('kableExtra'))
install.packages('kableExtra', repos='https://cran.asia/');
# if (!require('lubridate'))
# install.packages('lubridate', repos='https://cran.asia/');
if (!require('plotly'))
install.packages('plotly', repos='https://cran.asia/');
if (!require('plyr'))
install.packages('plyr', repos='https://cran.asia/');
if (!require('raster'))
install.packages('raster', repos='https://cran.asia/');
if (!require('scales'))
install.packages('scales', repos='https://cran.asia/');
# if (!require('tidyquant'))
# install.packages('tidyquant', repos='https://cran.asia/');
# if (!require('tidyr'))
# install.packages('tidyr', repos='https://cran.asia/');
# library(dplyr)
library(kableExtra)
# library(lubridate)
library(plotly)
library(plyr)
library(raster)
library(scales)
# library(tidyquant)
# library('tidyr')Import dataset
# covid_malaysia_endpoint <- 'https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/cases_malasia.csv'
# covid_state_endpoint <- 'https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/cases_state.csv'
covid_malaysia_endpoint <- 'cases_malaysia.csv'
covid_state_endpoint <- 'cases_state.csv'
df <- read.csv(covid_malaysia_endpoint, header=TRUE)
df_state <- read.csv(covid_state_endpoint, header=TRUE)
# Check the structure of the dataframe
str(df)## 'data.frame': 708 obs. of 31 variables:
## $ date : chr "2020-01-25" "2020-01-26" "2020-01-27" "2020-01-28" ...
## $ cases_new : int 4 0 0 0 3 1 0 0 0 0 ...
## $ cases_import : int 4 0 0 0 3 1 0 0 0 0 ...
## $ cases_recovered : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_active : int 4 4 4 4 7 8 8 8 8 8 ...
## $ cases_cluster : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_unvax : int 4 0 0 0 3 1 0 0 0 0 ...
## $ cases_pvax : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_fvax : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_boost : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_child : int 0 0 0 0 1 0 0 0 0 0 ...
## $ cases_adolescent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_adult : int 1 0 0 0 2 1 0 0 0 0 ...
## $ cases_elderly : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_0_4 : int 0 0 0 0 1 0 0 0 0 0 ...
## $ cases_5_11 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_12_17 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_18_29 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_30_39 : int 0 0 0 0 1 0 0 0 0 0 ...
## $ cases_40_49 : int 1 0 0 0 0 1 0 0 0 0 ...
## $ cases_50_59 : int 0 0 0 0 1 0 0 0 0 0 ...
## $ cases_60_69 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_70_79 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_80 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cluster_import : int NA NA NA NA NA NA NA NA NA NA ...
## $ cluster_religious : int NA NA NA NA NA NA NA NA NA NA ...
## $ cluster_community : int NA NA NA NA NA NA NA NA NA NA ...
## $ cluster_highRisk : int NA NA NA NA NA NA NA NA NA NA ...
## $ cluster_education : int NA NA NA NA NA NA NA NA NA NA ...
## $ cluster_detentionCentre: int NA NA NA NA NA NA NA NA NA NA ...
## $ cluster_workplace : int NA NA NA NA NA NA NA NA NA NA ...
str(df_state)## 'data.frame': 11344 obs. of 25 variables:
## $ date : chr "2020-01-25" "2020-01-25" "2020-01-25" "2020-01-25" ...
## $ state : chr "Johor" "Kedah" "Kelantan" "Melaka" ...
## $ cases_new : int 4 0 0 0 0 0 0 0 0 0 ...
## $ cases_import : int 4 0 0 0 0 0 0 0 0 0 ...
## $ cases_recovered : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_active : int 4 0 0 0 0 0 0 0 0 0 ...
## $ cases_cluster : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_unvax : int 4 0 0 0 0 0 0 0 0 0 ...
## $ cases_pvax : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_fvax : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_boost : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_child : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_adolescent: int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_adult : int 1 0 0 0 0 0 0 0 0 0 ...
## $ cases_elderly : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_0_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_5_11 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_12_17 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_18_29 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_30_39 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_40_49 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ cases_50_59 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_60_69 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_70_79 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cases_80 : int 0 0 0 0 0 0 0 0 0 0 ...
# Check the dimension of the dataframe
dim(df)## [1] 708 31
dim(df_state)## [1] 11344 25
# Check the first 6 rows
head(df) %>% kable('html') %>% kable_styling(font_size = 12)| date | cases_new | cases_import | cases_recovered | cases_active | cases_cluster | cases_unvax | cases_pvax | cases_fvax | cases_boost | cases_child | cases_adolescent | cases_adult | cases_elderly | cases_0_4 | cases_5_11 | cases_12_17 | cases_18_29 | cases_30_39 | cases_40_49 | cases_50_59 | cases_60_69 | cases_70_79 | cases_80 | cluster_import | cluster_religious | cluster_community | cluster_highRisk | cluster_education | cluster_detentionCentre | cluster_workplace |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-01-25 | 4 | 4 | 0 | 4 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-26 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-27 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-28 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-29 | 3 | 3 | 0 | 7 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-30 | 1 | 1 | 0 | 8 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
head(df_state) %>% kable('html') %>% kable_styling(font_size = 12)| date | state | cases_new | cases_import | cases_recovered | cases_active | cases_cluster | cases_unvax | cases_pvax | cases_fvax | cases_boost | cases_child | cases_adolescent | cases_adult | cases_elderly | cases_0_4 | cases_5_11 | cases_12_17 | cases_18_29 | cases_30_39 | cases_40_49 | cases_50_59 | cases_60_69 | cases_70_79 | cases_80 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-01-25 | Johor | 4 | 4 | 0 | 4 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2020-01-25 | Kedah | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020-01-25 | Kelantan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020-01-25 | Melaka | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020-01-25 | Negeri Sembilan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020-01-25 | Pahang | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Examine the statistics data
summary(df) %>% kable('html') %>% kable_styling(font_size = 12)| date | cases_new | cases_import | cases_recovered | cases_active | cases_cluster | cases_unvax | cases_pvax | cases_fvax | cases_boost | cases_child | cases_adolescent | cases_adult | cases_elderly | cases_0_4 | cases_5_11 | cases_12_17 | cases_18_29 | cases_30_39 | cases_40_49 | cases_50_59 | cases_60_69 | cases_70_79 | cases_80 | cluster_import | cluster_religious | cluster_community | cluster_highRisk | cluster_education | cluster_detentionCentre | cluster_workplace | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:708 | Min. : 0.0 | Min. : 0.00 | Min. : 0 | Min. : 1 | Min. : 0.0 | Min. : 0.0 | Min. : 0 | Min. : 0.0 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.00 | Min. : 0 | Min. : 0.0000 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.00 | Min. : 0.00 | Min. : 14.0 | |
| Class :character | 1st Qu.: 53.5 | 1st Qu.: 3.00 | 1st Qu.: 51 | 1st Qu.: 1212 | 1st Qu.: 17.0 | 1st Qu.: 53.5 | 1st Qu.: 0 | 1st Qu.: 0.0 | 1st Qu.: 0.00 | 1st Qu.: 2.0 | 1st Qu.: 3.0 | 1st Qu.: 38.75 | 1st Qu.: 4.0 | 1st Qu.: 1.0 | 1st Qu.: 1.0 | 1st Qu.: 3.0 | 1st Qu.: 16.0 | 1st Qu.: 10.0 | 1st Qu.: 6.0 | 1st Qu.: 4.0 | 1st Qu.: 2.0 | 1st Qu.: 1.00 | 1st Qu.: 0 | 1st Qu.: 0.0000 | 1st Qu.: 0.00 | 1st Qu.: 43.0 | 1st Qu.: 4.0 | 1st Qu.: 2.25 | 1st Qu.: 2.00 | 1st Qu.: 172.5 | |
| Mode :character | Median : 1542.0 | Median : 6.00 | Median : 1303 | Median : 15124 | Median : 364.0 | Median : 1205.5 | Median : 0 | Median : 0.0 | Median : 0.00 | Median : 122.0 | Median : 68.0 | Median : 1156.50 | Median : 93.5 | Median : 46.5 | Median : 75.0 | Median : 68.0 | Median : 466.5 | Median : 391.5 | Median : 197.0 | Median : 119.0 | Median : 64.5 | Median : 22.00 | Median : 8 | Median : 0.0000 | Median : 4.00 | Median :137.5 | Median : 16.0 | Median : 16.50 | Median : 37.00 | Median : 494.5 | |
| NA | Mean : 3900.4 | Mean : 12.21 | Mean : 3798 | Mean : 45991 | Mean : 693.7 | Mean : 2389.9 | Mean : 557 | Mean : 942.0 | Mean : 11.48 | Mean : 522.4 | Mean : 257.1 | Mean : 2666.92 | Mean : 348.9 | Mean : 208.0 | Mean : 314.4 | Mean : 257.1 | Mean :1006.1 | Mean : 820.5 | Mean : 494.7 | Mean : 345.7 | Mean : 223.6 | Mean : 92.29 | Mean : 33 | Mean : 0.4727 | Mean : 23.19 | Mean :198.8 | Mean : 27.3 | Mean : 37.53 | Mean : 60.47 | Mean : 624.4 | |
| NA | 3rd Qu.: 5299.5 | 3rd Qu.: 13.00 | 3rd Qu.: 5119 | 3rd Qu.: 63406 | 3rd Qu.:1088.8 | 3rd Qu.: 3322.8 | 3rd Qu.: 92 | 3rd Qu.: 290.2 | 3rd Qu.: 0.00 | 3rd Qu.: 744.8 | 3rd Qu.: 291.5 | 3rd Qu.: 3596.00 | 3rd Qu.: 554.5 | 3rd Qu.: 294.2 | 3rd Qu.: 450.8 | 3rd Qu.: 291.5 | 3rd Qu.:1243.5 | 3rd Qu.:1136.8 | 3rd Qu.: 670.2 | 3rd Qu.: 518.0 | 3rd Qu.: 364.0 | 3rd Qu.:149.00 | 3rd Qu.: 50 | 3rd Qu.: 0.0000 | 3rd Qu.: 16.00 | 3rd Qu.:300.2 | 3rd Qu.: 41.0 | 3rd Qu.: 40.00 | 3rd Qu.: 86.00 | 3rd Qu.:1049.5 | |
| NA | Max. :24599.0 | Max. :366.00 | Max. :24855 | Max. :263850 | Max. :3394.0 | Max. :12684.0 | Max. :7318 | Max. :8448.0 | Max. :305.00 | Max. :3437.0 | Max. :1820.0 | Max. :16450.00 | Max. :1986.0 | Max. :1362.0 | Max. :2091.0 | Max. :1820.0 | Max. :6374.0 | Max. :4922.0 | Max. :3132.0 | Max. :2066.0 | Max. :1231.0 | Max. :581.00 | Max. :210 | Max. :54.0000 | Max. :359.00 | Max. :825.0 | Max. :189.0 | Max. :501.00 | Max. :439.00 | Max. :2338.0 | |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :342 | NA’s :342 | NA’s :342 | NA’s :342 | NA’s :342 | NA’s :342 | NA’s :342 |
summary(df_state) %>% kable('html') %>% kable_styling(font_size = 12)| date | state | cases_new | cases_import | cases_recovered | cases_active | cases_cluster | cases_unvax | cases_pvax | cases_fvax | cases_boost | cases_child | cases_adolescent | cases_adult | cases_elderly | cases_0_4 | cases_5_11 | cases_12_17 | cases_18_29 | cases_30_39 | cases_40_49 | cases_50_59 | cases_60_69 | cases_70_79 | cases_80 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:11344 | Length:11344 | Min. : 0.0 | Min. : 0.0000 | Min. : 0.0 | Min. : -2.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0000 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.00 | Min. : 0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.00 | Min. : 0.000 | Min. : 0.000 | |
| Class :character | Class :character | 1st Qu.: 0.0 | 1st Qu.: 0.0000 | 1st Qu.: 0.0 | 1st Qu.: 11.0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.0000 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.0 | 1st Qu.: 0.00 | 1st Qu.: 0 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.0 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | 1st Qu.: 0.000 | |
| Mode :character | Mode :character | Median : 19.0 | Median : 0.0000 | Median : 16.0 | Median : 274.5 | Median : 3.0 | Median : 14.0 | Median : 0.00 | Median : 0.00 | Median : 0.0000 | Median : 1.00 | Median : 1.00 | Median : 13.0 | Median : 1.00 | Median : 0 | Median : 1.00 | Median : 1.00 | Median : 4.00 | Median : 4.00 | Median : 2.00 | Median : 2.0 | Median : 1.00 | Median : 0.000 | Median : 0.000 | |
| NA | NA | Mean : 243.7 | Mean : 0.7916 | Mean : 237.3 | Mean : 2874.0 | Mean : 43.3 | Mean : 149.2 | Mean : 34.77 | Mean : 58.97 | Mean : 0.7377 | Mean : 32.64 | Mean : 16.06 | Mean : 166.6 | Mean : 21.81 | Mean : 13 | Mean : 19.64 | Mean : 16.06 | Mean : 62.85 | Mean : 51.27 | Mean : 30.91 | Mean : 21.6 | Mean : 13.97 | Mean : 5.768 | Mean : 2.063 | |
| NA | NA | 3rd Qu.: 237.0 | 3rd Qu.: 0.0000 | 3rd Qu.: 215.2 | 3rd Qu.: 2532.5 | 3rd Qu.: 40.0 | 3rd Qu.: 134.0 | 3rd Qu.: 3.00 | 3rd Qu.: 8.00 | 3rd Qu.: 0.0000 | 3rd Qu.: 29.00 | 3rd Qu.: 12.00 | 3rd Qu.: 160.0 | 3rd Qu.: 22.25 | 3rd Qu.: 11 | 3rd Qu.: 17.00 | 3rd Qu.: 12.00 | 3rd Qu.: 59.00 | 3rd Qu.: 50.00 | 3rd Qu.: 29.00 | 3rd Qu.: 22.0 | 3rd Qu.: 15.00 | 3rd Qu.: 6.000 | 3rd Qu.: 2.000 | |
| NA | NA | Max. :8792.0 | Max. :74.0000 | Max. :8801.0 | Max. :94137.0 | Max. :1545.0 | Max. :6112.0 | Max. :3890.00 | Max. :3610.00 | Max. :85.0000 | Max. :1002.00 | Max. :527.00 | Max. :6549.0 | Max. :637.00 | Max. :429 | Max. :608.00 | Max. :527.00 | Max. :2524.00 | Max. :2097.00 | Max. :1265.00 | Max. :699.0 | Max. :449.00 | Max. :157.000 | Max. :62.000 |
Handle missing/duplicate values
# Check for the columns with missing values
colSums(is.na(df)) %>% kable('html') %>% kable_styling(font_size = 12)| x | |
|---|---|
| date | 0 |
| cases_new | 0 |
| cases_import | 0 |
| cases_recovered | 0 |
| cases_active | 0 |
| cases_cluster | 0 |
| cases_unvax | 0 |
| cases_pvax | 0 |
| cases_fvax | 0 |
| cases_boost | 0 |
| cases_child | 0 |
| cases_adolescent | 0 |
| cases_adult | 0 |
| cases_elderly | 0 |
| cases_0_4 | 0 |
| cases_5_11 | 0 |
| cases_12_17 | 0 |
| cases_18_29 | 0 |
| cases_30_39 | 0 |
| cases_40_49 | 0 |
| cases_50_59 | 0 |
| cases_60_69 | 0 |
| cases_70_79 | 0 |
| cases_80 | 0 |
| cluster_import | 342 |
| cluster_religious | 342 |
| cluster_community | 342 |
| cluster_highRisk | 342 |
| cluster_education | 342 |
| cluster_detentionCentre | 342 |
| cluster_workplace | 342 |
colSums(is.na(df_state)) %>% kable('html') %>% kable_styling(font_size = 12)| x | |
|---|---|
| date | 0 |
| state | 0 |
| cases_new | 0 |
| cases_import | 0 |
| cases_recovered | 0 |
| cases_active | 0 |
| cases_cluster | 0 |
| cases_unvax | 0 |
| cases_pvax | 0 |
| cases_fvax | 0 |
| cases_boost | 0 |
| cases_child | 0 |
| cases_adolescent | 0 |
| cases_adult | 0 |
| cases_elderly | 0 |
| cases_0_4 | 0 |
| cases_5_11 | 0 |
| cases_12_17 | 0 |
| cases_18_29 | 0 |
| cases_30_39 | 0 |
| cases_40_49 | 0 |
| cases_50_59 | 0 |
| cases_60_69 | 0 |
| cases_70_79 | 0 |
| cases_80 | 0 |
# Show first few rows of the missing values
head(df[rowSums(is.na(df)) > 0,]) %>% kable('html') %>% kable_styling(font_size = 12)| date | cases_new | cases_import | cases_recovered | cases_active | cases_cluster | cases_unvax | cases_pvax | cases_fvax | cases_boost | cases_child | cases_adolescent | cases_adult | cases_elderly | cases_0_4 | cases_5_11 | cases_12_17 | cases_18_29 | cases_30_39 | cases_40_49 | cases_50_59 | cases_60_69 | cases_70_79 | cases_80 | cluster_import | cluster_religious | cluster_community | cluster_highRisk | cluster_education | cluster_detentionCentre | cluster_workplace |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-01-25 | 4 | 4 | 0 | 4 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-26 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-27 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-28 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-29 | 3 | 3 | 0 | 7 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-30 | 1 | 1 | 0 | 8 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA | NA |
head(df[rowSums(is.na(df_state)) > 0,]) %>% kable('html') %>% kable_styling(font_size = 12)| date | cases_new | cases_import | cases_recovered | cases_active | cases_cluster | cases_unvax | cases_pvax | cases_fvax | cases_boost | cases_child | cases_adolescent | cases_adult | cases_elderly | cases_0_4 | cases_5_11 | cases_12_17 | cases_18_29 | cases_30_39 | cases_40_49 | cases_50_59 | cases_60_69 | cases_70_79 | cases_80 | cluster_import | cluster_religious | cluster_community | cluster_highRisk | cluster_education | cluster_detentionCentre | cluster_workplace |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
# The missing rows for df can be ignored as there are 2020 data. 2021 data contains more columns.
# There is no missing rows for df_state.
# Check for duplicate values
df[duplicated(df)]## data frame with 0 columns and 708 rows
df[duplicated(df_state)]## data frame with 0 columns and 708 rows
# There are no duplicated rowsPreprocessing
df$date <- as.Date(df$date, format='%Y-%m-%d')
df_state$date <- as.Date(df_state$date, format='%Y-%m-%d')Exploratory Data Analysis
TODO: Results may include visualization, prediction, evaluation of models and discussion of output
A brief Look on the graph
fig <- plot_ly(df, type = 'scatter', mode = 'lines')%>%
add_trace(x = ~date, y = ~cases_new, name = 'Daily New Cvoid Cases')%>%
layout(showlegend = F)
options(warn = -1)
fig <- fig %>%
layout(
xaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
yaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
plot_bgcolor='#e5ecf6', width = 1200)
figDensity Map
df_state <- df_state %>%
mutate(date = as.Date(df_state$date, format = "%Y-%m-%d")) %>%
filter(date == as.Date('2021-09-01')) %>%
mutate(state = replace(state, state == "W.P. Kuala Lumpur", "Kuala Lumpur")) %>%
mutate(state = replace(state, state == "W.P. Labuan", "Labuan")) %>%
mutate(state = replace(state, state == "W.P. Putrajaya", "Putrajaya")) %>%
arrange(state) %>%
dplyr::rename(NAME_1 = state)
malaysia <- getData("GADM", country = "MYS", level = 1)
malaysia@data$id <- rownames(malaysia@data)
malaysia@data <- join(malaysia@data, df_state, by = "NAME_1")
malaysia_df <- fortify(malaysia)## Regions defined for each Polygons
malaysia_df <- join(malaysia_df, malaysia@data, by = "id")
theme_opts <- list(theme(
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.background = element_blank(),
plot.background = element_blank(),
axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.title = element_blank()
))
# https://garthtarr.github.io/meatR/ggplot_extensions.html
# https://rstudio-pubs-static.s3.amazonaws.com/160207_ebe47475bb7744429b9bd4c908e2dc45.html
ggplot() +
geom_polygon(data = malaysia_df, aes(x = long, y = lat, group = group, fill = cases_new), color = "white", size = 0.25) +
theme(aspect.ratio = 2/5) +
scale_fill_distiller(name = "No. of New Cases", palette = "YlOrRd", direction=1, breaks = pretty_breaks(n = 5)) +
labs(title = paste('Number of New Cases in Each State on', '2021-09-01'))Machine Learning
TODO: Results may include visualization, prediction, evaluation of models and discussion of output
Conclusion
TODO: Conclusion
Presentation and Submission
TODO Report: Submission will be an R markdown published at Rpubs, and the link is to be submitted in spectrum. The R markdown may include the following:
- Short introduction with objective of the project.
- Explanation of all the processes involved in the project
- Results may include visualization, prediction, evaluation of models and discussion of output
- Conclusion
TODO: Only one member per group will submit the report.
TODO: Each group is required to prepare a 10 minute presentation with powerpoint.
TODO: Both group members must present their parts.